PC4

Author

Jack Walton, Calista Kerins, Dylan Schmidt, Jhon Chavez-Matul

1 Project Proposal + Data

1.1 Quantitative Variables

  • Variable 1: Happiness score (WHR)*

  • Variable 2: Life Expectancy*

*Measured each year

1.2 Detailed Data Description

Happiness Score Dataset: This data set contains average scores about life satisfaction using a scale known as the Cantril life ladder, measured across nations over multiple years. Respondents imagined a ladder with steps ranging from 0 at the bottom (representing the worst possible life) to 10 at the top (representing the best possible life) and indicated where they feel they stand. Gapminder has converted this scale from 0 to 100 for clarity, representing it in terms of percentage. The data comes from the World Happiness Report collected by the Gallup World Poll. It contains data from 2005-2020.


Life Expectancy Dataset: This data set contains life expectancy across different nations over time, measured as the number of years a newborn infant would live if the current mortality rates at different ages were to stay the same throughout its life. The data is compiled by Gapminder and mainly comes from the Institute of Health Metrics and Evaluation and UN forecasts. It contains data from 1800 to 2100 (Predicted).

Hypothesized Relationship

We hypothesize that countries with a higher average happiness score will also have a high life expectancy. We assume people who are happier try to stay healthy and take better care of themselves because they enjoy their lives.

1.3 Data Cleaning Process

  1. First, we used pivot_longer to pivot the tables, consolidating all the different year variables into one column named ‘Year’

  2. Then, we removed all rows containing ‘NA’ values.

  3. Next, we did an inner join on column ‘Year’ (same name in both datasets) to combine both tables, and keep only the rows where data was collected for both tables during that year.

library(tidyverse)
life_expectancies_by_country <- read.csv("Life Expectancy-Dataset-countries-etc-by-year.csv")
happiness_scores_by_country <- read.csv("Happiness-Dataset-countries-by-year.csv")

full_dataset <- inner_join(life_expectancies_by_country, 
                           happiness_scores_by_country, 
                           by = c("time", "name")) |>
                rename(Happiness.score = Happiness.score..WHR.)

2 Linear Regression

Explanatory Variable (x): Happiness score

Response Variable (y): Life expectancy

2.1 Data Visualization

Let’s plot the data with the explanatory variable, happiness score, on the x-axis and the response variable, life expectancy, on the y-axis. We do this since it makes more sense that happiness would affect life expectancy rather than life expectancy affecting happiness.

plot <- ggplot(data= full_dataset, aes(x= Happiness.score, y=Life.expectancy)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  labs(x = "Happiness score", y = "", subtitle = "Life Expectancy (Years)", title = "Relation Between Happiness and Life Expectancy")
plot(plot)

Excluding the outliers, the points seem to follow a somewhat linear trend. Now let’s seperate the points by years and animate a similar plot over time.

library(gganimate)
library(gifski)

anim <- plot +
  transition_time(time) + 
  labs(title = "Relation Between Happiness and Life Expectancy: {frame_time}")

animate(anim, renderer = gifski_renderer("happiness_life_expectancy.gif"))

Gif

2.1.0.1 Fitting a simple linear regression model to check for a relationship:

First, we cleaned the data to ensure one x value and one y value per country. We chose to use the average values for life expectancy and happiness scores for each country across all years, in order to get a more complete picture.

clean_dataset <- full_dataset |>
  group_by(name) |>
  summarise(LifeExpectancy = mean(Life.expectancy), 
            HappinessScore = mean(Happiness.score))
  

model <- lm(LifeExpectancy ~ HappinessScore, clean_dataset)

model

Call:
lm(formula = LifeExpectancy ~ HappinessScore, data = clean_dataset)

Coefficients:
   (Intercept)  HappinessScore  
        39.972           0.586  

\[ \text{Life Expectancy} = \beta_0 + \beta_1 \cdot \text{Happiness Score} \]From the summary output we find that the y-intercept \(\beta_0\) is \(39.972\) and the slope \(\beta_1\) is \(0.586\). Thus, the estimated regression equation is: \[ \text{Life Expectancy} = 39.972 + 0.586 \cdot \text{Happiness Score} \tag{1}\]

The equation seen above represents the linear regression line. This suggests a unit increase in happiness score corresponds to a 0.586 year increase in life expectancy.